Skip to content

zonal: guard unbounded allocation in stats(return_type='xarray.DataArray') (#2523)#2533

Merged
brendancol merged 1 commit into
mainfrom
deep-sweep-security-zonal-2026-05-27-01
May 28, 2026
Merged

zonal: guard unbounded allocation in stats(return_type='xarray.DataArray') (#2523)#2533
brendancol merged 1 commit into
mainfrom
deep-sweep-security-zonal-2026-05-27-01

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Summary

Closes #2523. Found by deep-sweep-security-zonal-2026-05-27.

Test plan

  • New unit test test_stats_dataarray_return_type_memory_guard_2523 mocks _available_memory_bytes and asserts both the guard fires and the normal path still produces a correct xr.DataArray
  • Full xrspatial/tests/test_zonal.py suite passes (126 tests)

…ray') (#2523)

The numpy backend's xarray.DataArray return path allocated
`np.full((n_stats, values.size), nan)` (float64) with no memory check.
`n_stats` is user-controlled via the `stats_funcs` dict and `values.size`
grows with the input raster, so the working buffer scales linearly with
both. The default 8 stats on a 20000x20000 input requires ~25.6 GB on
top of the input rasters.

Add `_check_stats_dataarray_memory(n_stats, values_shape)` that reuses
the existing `_available_memory_bytes()` helper and raises `MemoryError`
when `n_stats * H * W * 8` exceeds 50% of available RAM, with a message
pointing at the user-tunable knobs (`stats_funcs`, raster size,
`return_type='pandas.DataFrame'`).

Pattern matches the per-module unbounded-allocation guards added in
#1262 (cost_distance), #1287 (kde), #1288 (mahalanobis), #1291
(multispectral), #1295 (resample), #1296 (sieve), and others.

Found by deep-sweep-security-zonal-2026-05-27.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 27, 2026
@brendancol

Copy link
Copy Markdown
Contributor Author

PR Review: zonal: guard unbounded allocation in stats(return_type='xarray.DataArray')

Blockers

None.

Suggestions

None.

Nits

  • xrspatial/zonal.py:2025-2027: n_cells = 1; for s in values_shape: n_cells *= int(s) works fine, but int(np.prod(values_shape)) would match the style of _regions_dask (line 2047). Cosmetic.

What looks good

  • Guard runs before the np.full(...) call on line 506, so it fires before the allocation, not after.
  • Reuses the existing _available_memory_bytes() helper and the required > 0.5 * available threshold. Matches sieve.py, mahalanobis.py, dasymetric.py, and _regions_dask.
  • Error message names the call site (stats(return_type='xarray.DataArray')), reports required vs available GB, and lists three concrete mitigations: reduce stats_funcs, use a smaller raster, or switch to return_type='pandas.DataFrame'.
  • Regression test mocks _available_memory_bytes to a 100-byte budget, asserts the guard fires, and verifies the normal path still produces a correct xr.DataArray.
  • Other backends do not hit this code path. _stats_cupy and the dask backends always return a DataFrame, and the top-level stats() only assembles the xarray result after the numpy backend has returned the (n_stats, H, W) ndarray, so the guard sits in the right place.
  • Sweep state CSV updated.

Checklist

  • Guard fires before the allocation
  • Error message is specific and actionable
  • Threshold pattern matches sibling modules (cost_distance, kde, mahalanobis, dasymetric, sieve)
  • Regression test with synthetic limit
  • pytest xrspatial/tests/test_zonal.py -x -q passes (126/126)

@brendancol brendancol merged commit efe5063 into main May 28, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

zonal.stats: unbounded allocation with return_type='xarray.DataArray'

1 participant